This bike sharing dataset (hour.csv) was obtained from UCI machine learning repository. Below is information of the dataset extracted and modified from the included “Readme.txt” :


Background


Bike sharing systems are a new generation of traditional bike rentals where the whole process from membership, rental and return back has become automatic. Through these systems, the user is able to easily rent a bike from a particular position and return back to another position. Currently, there are about over 500 bike-sharing programs around the world which are composed of over 500 thousand bicycles. Today, there exists great interest in these systems due to their important role in traffic, environmental and health issues.


Dataset


The bike-sharing rental process is highly correlated to the environmental and seasonal settings. For instance, weather conditions, precipitation, day of week, season, hour of the day, etc. can affect the rental behaviors. The core data set is related to the two-year historical log corresponding to years 2011 and 2012 from Capital Bikeshare system, Washington D.C.


Dataset Characteristics


hour.csv

- instant: record index
- dteday : date
- season : season (1:springer, 2:summer, 3:fall, 4:winter)
- yr : year (0: 2011, 1:2012)
- mnth : month ( 1 to 12)
- hr : hour (0 to 23)
- holiday : weather day is holiday or not
- weekday : day of the week
- workingday : if day is neither weekend nor holiday is 1, otherwise is 0.
+ weathersit : 
    - 1: Clear, Few clouds, Partly cloudy, Partly cloudy
    - 2: Mist + Cloudy, Mist + Broken clouds, Mist + Few clouds, Mist
    - 3: Light Snow, Light Rain + Thunderstorm + Scattered clouds, Light Rain + Scattered clouds
    - 4: Heavy Rain + Ice Pallets + Thunderstorm + Mist, Snow + Fog
- temp : Normalized temperature in Celsius. The values are divided to 41 (max)
- atemp: Normalized feeling temperature in Celsius. The values are divided to 50 (max)
- hum: Normalized humidity. The values are divided to 100 (max)
- windspeed: Normalized wind speed. The values are divided to 67 (max)
- casual: count of bike rented bycasual users
- registered: count of registered users
- cnt: count of total rental bikes including both casual and registered

Objective


Prediction of hourly bike rental count based on the environmental and seasonal settings.

Let’s read the input and take a look at its structure.

# Read
hour.data <- read.csv("hour.csv", header= TRUE,stringsAsFactors = FALSE)

# Overview
str(hour.data)
## 'data.frame':    17379 obs. of  17 variables:
##  $ instant   : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ dteday    : chr  "2011-01-01" "2011-01-01" "2011-01-01" "2011-01-01" ...
##  $ season    : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ yr        : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ mnth      : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ hr        : int  0 1 2 3 4 5 6 7 8 9 ...
##  $ holiday   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ weekday   : int  6 6 6 6 6 6 6 6 6 6 ...
##  $ workingday: int  0 0 0 0 0 0 0 0 0 0 ...
##  $ weathersit: int  1 1 1 1 1 2 1 1 1 1 ...
##  $ temp      : num  0.24 0.22 0.22 0.24 0.24 0.24 0.22 0.2 0.24 0.32 ...
##  $ atemp     : num  0.288 0.273 0.273 0.288 0.288 ...
##  $ hum       : num  0.81 0.8 0.8 0.75 0.75 0.75 0.8 0.86 0.75 0.76 ...
##  $ windspeed : num  0 0 0 0 0 0.0896 0 0 0 0 ...
##  $ casual    : int  3 8 5 3 0 0 2 1 1 8 ...
##  $ registered: int  13 32 27 10 1 1 0 2 7 6 ...
##  $ cnt       : int  16 40 32 13 1 1 2 3 8 14 ...
head(hour.data)
##   instant     dteday season yr mnth hr holiday weekday workingday
## 1       1 2011-01-01      1  0    1  0       0       6          0
## 2       2 2011-01-01      1  0    1  1       0       6          0
## 3       3 2011-01-01      1  0    1  2       0       6          0
## 4       4 2011-01-01      1  0    1  3       0       6          0
## 5       5 2011-01-01      1  0    1  4       0       6          0
## 6       6 2011-01-01      1  0    1  5       0       6          0
##   weathersit temp  atemp  hum windspeed casual registered cnt
## 1          1 0.24 0.2879 0.81    0.0000      3         13  16
## 2          1 0.22 0.2727 0.80    0.0000      8         32  40
## 3          1 0.22 0.2727 0.80    0.0000      5         27  32
## 4          1 0.24 0.2879 0.75    0.0000      3         10  13
## 5          1 0.24 0.2879 0.75    0.0000      0          1   1
## 6          2 0.24 0.2576 0.75    0.0896      0          1   1

Split data into training and testing datasets for applying models.

train <- hour.data[as.integer(substr(hour.data$dteday,9,10)) < 22, ]
test <- hour.data[as.integer(substr(hour.data$dteday,9,10)) > 21, ]

# Training: 69.2% 
nrow(train)/ nrow(hour.data)
## [1] 0.6924449
# Testing: 30.7%
nrow(test)/ nrow(hour.data)
## [1] 0.3075551

Model Selection


GOAL: Apply different models to find predictive results of hourly total rental of a day

Technical explanation: For each model built upon training dataset, there will be predictive values against actual values of the testing dataset. The measurement used here is mean((y - yhat)^2) i.e. Mean Squared Error(MSE). We are going to apply multiple models and figure out the one that minimizes MSE. The total rental (cnt) of a day is the sum of registered users (registered) and casual users (casual). After trying different combinations, we found out that using 2 separate models to predict the number of registered users and casual users yield better result than using a single model to predict total rentals (cnt)

The followings are models we have applied:

  1. Neural Networks (NN)
  2. Linear Regression
  3. Support Vector Machine (SVM)
  4. Random Forest
  5. Gereralized Boosting Model (GBM)
  6. Regression Tree

The results for each model are “cnt.MSE”, “combined.MSE”, “registered.MSE”, “casual.MSE”

# Load packages
library(nnet)
library(ggplot2)
library(ggthemes)
library(gbm)
library(randomForest)
library(e1071)
library(rpart)

First, get data ready. Before factorizing some of the attributes, we leave numeric variables as they are for Neural Networks.


(1) Neural Networks

Step1: Orignal model

# Orignal Model
neural.formula = cnt ~ season + yr + mnth + hr + holiday + weekday + workingday+ weathersit + temp + atemp + hum + windspeed
neural.model = nnet(neural.formula, train, size=20, maxit=5000, linout=T, decay=0.01)
## # weights:  281
## initial  value 838926398.354622 
## final  value 394926965.983082 
## converged
testset=subset(test, select = c("season", "yr" , "mnth" , "hr" , "holiday" , "weekday" , "workingday" , "weathersit" , "temp" , "atemp" , "hum" , "windspeed"))

neural.cnt = predict(neural.model, testset, type="raw")

test$neural.cnt=neural.cnt
# Compute MSE
neural.MSE = sum((test$cnt - test$neural.cnt)^2)/nrow(test)
neural.MSE
## [1] 33104.73
# Plot to check result
neural.result = ggplot(test,aes(cnt,neural.cnt))+geom_point()
neural.result

# Change negative result to positive
test$neural.cnt[test$neural.cnt < 0] = 0
# Compute new MSE
neural.MSE = sum((test$cnt - test$neural.cnt)^2)/nrow(test)
neural.MSE
## [1] 33104.73
neural.result = ggplot(test,aes(cnt,neural.cnt))+geom_point()
neural.result

Step2: Sepeate models for registered and casual

# Model for Registered
# Take off weedspeed and holiday yeild to the best result
neural.registered.formula = registered~season + yr + mnth + hr + weekday + workingday+ weathersit + temp + atemp + hum
neural.model.registered = nnet(neural.registered.formula, train, size=20, maxit=5000, linout=T, decay=0.01)
## # weights:  241
## initial  value 559079226.792752 
## iter  10 value 244122412.256148
## iter  20 value 198835761.212141
## iter  30 value 172241855.672617
## iter  40 value 159896361.855667
## iter  50 value 149723367.246861
## iter  60 value 143958105.397097
## iter  70 value 140582206.518999
## iter  80 value 135858127.292951
## iter  90 value 134444900.342117
## iter 100 value 131076044.824574
## iter 110 value 129003257.450227
## iter 120 value 126803272.430574
## iter 130 value 125148403.082232
## iter 140 value 123765072.112062
## iter 150 value 122833001.124674
## iter 160 value 122420570.457014
## iter 170 value 122146540.729700
## iter 180 value 121508383.111550
## iter 190 value 120797501.212617
## iter 200 value 119289090.475980
## iter 210 value 117283661.656061
## iter 220 value 116928914.257144
## iter 230 value 116455547.924404
## iter 240 value 116196989.513513
## iter 250 value 116006729.517431
## iter 260 value 115853339.560867
## iter 270 value 115744019.563024
## iter 280 value 115564123.052955
## iter 290 value 114046266.827014
## iter 300 value 113380129.362451
## iter 310 value 113099404.896490
## iter 320 value 112622887.171588
## iter 330 value 112207871.221034
## iter 340 value 111974808.939365
## iter 350 value 111644077.477575
## iter 360 value 111366566.970739
## iter 370 value 111080020.921324
## iter 380 value 111020365.526848
## iter 390 value 110897315.553646
## iter 400 value 110779600.818873
## iter 410 value 110644639.842047
## iter 420 value 110569880.114434
## iter 430 value 110501473.108163
## iter 440 value 110441533.882779
## iter 450 value 110414691.564901
## iter 460 value 110383885.135484
## iter 470 value 110290044.217288
## iter 480 value 110210184.129649
## iter 490 value 110113569.851962
## iter 500 value 110062291.187289
## iter 510 value 110005160.290864
## iter 520 value 109978847.879917
## iter 530 value 109961610.877135
## iter 540 value 109941394.466741
## iter 550 value 109922017.953852
## iter 560 value 109901834.539828
## iter 570 value 109826683.005086
## iter 580 value 109753915.727100
## iter 590 value 109676437.324137
## iter 600 value 109561225.828613
## iter 610 value 108799917.119944
## iter 620 value 108402966.945694
## iter 630 value 105917123.445489
## iter 640 value 104814292.194800
## iter 650 value 101439097.744597
## iter 660 value 98970212.936261
## iter 670 value 97465910.893124
## iter 680 value 96805237.009146
## iter 690 value 95580666.069254
## iter 700 value 95080036.931786
## iter 710 value 94729714.984066
## iter 720 value 93574507.077844
## iter 730 value 92812232.142079
## iter 740 value 91544940.245097
## iter 750 value 89453453.290026
## iter 760 value 87073511.723972
## iter 770 value 82481155.311038
## iter 780 value 79617302.936259
## iter 790 value 77456387.416356
## iter 800 value 76362693.260844
## iter 810 value 75516901.472948
## iter 820 value 74847652.832301
## iter 830 value 74322434.453099
## iter 840 value 73848367.078481
## iter 850 value 73081127.822422
## iter 860 value 72488329.006398
## iter 870 value 72035374.196297
## iter 880 value 71228895.331938
## iter 890 value 69960770.672154
## iter 900 value 67656451.233689
## iter 910 value 62979954.060001
## iter 920 value 57660771.614261
## iter 930 value 55394785.880783
## iter 940 value 53058737.232892
## iter 950 value 52705536.559385
## iter 960 value 52531554.511734
## iter 970 value 52429550.678954
## iter 980 value 52234374.537650
## iter 990 value 51922074.421280
## iter1000 value 51710030.019903
## iter1010 value 51549289.829091
## iter1020 value 51446092.755695
## iter1030 value 51375060.380095
## iter1040 value 51223718.368806
## iter1050 value 51058149.851851
## iter1060 value 50659691.616179
## iter1070 value 50420438.786517
## iter1080 value 50207396.675964
## iter1090 value 49783403.792431
## iter1100 value 49437857.683047
## iter1110 value 48918477.870983
## iter1120 value 48752747.363887
## iter1130 value 48215506.907311
## iter1140 value 47656183.238526
## iter1150 value 47058193.091308
## iter1160 value 46187839.801406
## iter1170 value 45039720.867068
## iter1180 value 43946970.259374
## iter1190 value 43532922.402782
## iter1200 value 43246679.685316
## iter1210 value 42800341.303212
## iter1220 value 42022339.931764
## iter1230 value 40400217.698716
## iter1240 value 39489599.787395
## iter1250 value 38353772.465059
## iter1260 value 37820524.038655
## iter1270 value 36748161.996172
## iter1280 value 35810227.011449
## iter1290 value 35616556.495050
## iter1300 value 35248877.807562
## iter1310 value 34639320.615514
## iter1320 value 34382316.374729
## iter1330 value 34214160.281367
## iter1340 value 34192328.535653
## iter1350 value 34152339.295509
## iter1360 value 34111114.161009
## iter1370 value 34097937.004546
## iter1380 value 34078436.946578
## iter1390 value 34051061.526745
## iter1400 value 34037571.769580
## iter1410 value 34025135.465457
## iter1420 value 33999130.184042
## iter1430 value 33950323.293723
## iter1440 value 33903986.762989
## iter1450 value 33829933.615261
## iter1460 value 33697518.017423
## iter1470 value 33520626.642584
## iter1480 value 33190208.408215
## iter1490 value 32894186.811921
## iter1500 value 32466036.235305
## iter1510 value 32036000.655080
## iter1520 value 31938142.622689
## iter1530 value 31906610.041876
## iter1540 value 31871882.273023
## iter1550 value 31779252.245909
## iter1560 value 31670905.659265
## iter1570 value 31526407.717114
## iter1580 value 31357983.550165
## iter1590 value 31302805.562268
## iter1600 value 31219840.603303
## iter1610 value 31158047.803200
## iter1620 value 31041234.174500
## iter1630 value 30790626.145853
## iter1640 value 30673098.508733
## iter1650 value 30566799.993406
## iter1660 value 30476549.315130
## iter1670 value 30335209.449961
## iter1680 value 30236122.739286
## iter1690 value 30086903.824584
## iter1700 value 30003822.418413
## iter1710 value 29958724.947062
## iter1720 value 29896820.698914
## iter1730 value 29795225.837558
## iter1740 value 29610809.132116
## iter1750 value 29390604.433224
## iter1760 value 29266915.598379
## iter1770 value 29174840.906512
## iter1780 value 29044946.715798
## iter1790 value 28896258.549168
## iter1800 value 28804354.054914
## iter1810 value 28704166.613556
## iter1820 value 28625558.477181
## iter1830 value 28568733.215553
## iter1840 value 28529991.792531
## iter1850 value 28469317.911875
## iter1860 value 28385640.409149
## iter1870 value 28244681.360978
## iter1880 value 27991582.970112
## iter1890 value 27680652.076570
## iter1900 value 27349407.086989
## iter1910 value 27048744.918593
## iter1920 value 26585907.329297
## iter1930 value 26017838.410745
## iter1940 value 25407932.101723
## iter1950 value 24896642.186017
## iter1960 value 24679219.077776
## iter1970 value 24484155.395033
## iter1980 value 24470652.440568
## iter1990 value 24452754.574735
## iter2000 value 24433932.359369
## iter2010 value 24422593.757120
## iter2020 value 24414327.745760
## iter2030 value 24398835.089358
## iter2040 value 24386417.833843
## iter2050 value 24364964.561560
## iter2060 value 24344364.335329
## iter2070 value 24330941.295770
## iter2080 value 24318483.572976
## iter2090 value 24285606.613261
## iter2100 value 24268950.251871
## iter2110 value 24242930.886654
## iter2120 value 24205612.422663
## iter2130 value 24179846.545917
## iter2140 value 24159477.547740
## iter2150 value 24125680.782277
## iter2160 value 24099422.121075
## iter2170 value 24080010.809248
## iter2180 value 24034384.169264
## iter2190 value 23971444.366787
## iter2200 value 23937069.000504
## iter2210 value 23908727.288832
## iter2220 value 23899393.375325
## iter2230 value 23893490.207669
## iter2240 value 23889889.802172
## iter2250 value 23888723.392403
## iter2260 value 23887665.261134
## iter2270 value 23885947.736077
## iter2280 value 23877638.608157
## iter2290 value 23863749.677693
## iter2300 value 23846211.463911
## iter2310 value 23835682.210673
## iter2320 value 23822754.388774
## iter2330 value 23807289.942873
## iter2340 value 23792139.766723
## iter2350 value 23771462.036692
## iter2360 value 23762597.508772
## iter2370 value 23753368.520960
## iter2380 value 23743302.551354
## iter2390 value 23731956.487801
## iter2400 value 23716496.428334
## iter2410 value 23701454.000702
## iter2420 value 23683188.881635
## iter2430 value 23678827.673857
## iter2440 value 23671115.094533
## iter2450 value 23665062.596193
## iter2460 value 23652990.403843
## iter2470 value 23632827.603202
## iter2480 value 23593735.326086
## iter2490 value 23578637.551862
## iter2500 value 23566109.017511
## iter2510 value 23545639.250071
## iter2520 value 23528554.018431
## iter2530 value 23508663.384315
## iter2540 value 23480807.397958
## iter2550 value 23452509.894150
## iter2560 value 23440202.892575
## iter2570 value 23430094.954674
## iter2580 value 23427965.121837
## iter2590 value 23427383.902601
## iter2600 value 23426502.815646
## final  value 23426257.224004 
## converged
neural.registered = predict(neural.model.registered, testset, type="raw")
test$neural.registered = neural.registered
neural.registered.MSE = sum((test$neural.registered-test$registered)^2)/nrow(test)
neural.registered.MSE
## [1] 3450.674
neural.registered.result = ggplot(test,aes(registered,neural.registered))+geom_point()
neural.registered.result

test$neural.registered[test$neural.registered < 0] = 0
neural.registered.MSE = sum((test$neural.registered-test$registered)^2)/nrow(test)
neural.registered.MSE
## [1] 3407.284
neural.registered.result = ggplot(test,aes(registered,neural.registered))+geom_point()
neural.registered.result

# Model for Casual
neural.casual.formula = casual~season + yr + mnth + hr + holiday + weekday + workingday+ weathersit + temp + atemp + hum + windspeed
neural.model.casual = nnet(neural.casual.formula, train, size=20, maxit=5000, linout=T, decay=0.01)
## # weights:  281
## initial  value 47551479.555417 
## iter  10 value 25785482.540706
## iter  20 value 21236529.317735
## iter  30 value 17849897.644603
## iter  40 value 16427977.929016
## iter  50 value 14288098.905237
## iter  60 value 12579896.951259
## iter  70 value 11675262.602377
## iter  80 value 10167238.524467
## iter  90 value 9551156.252311
## iter 100 value 9421502.681934
## iter 110 value 9236508.832616
## iter 120 value 9080640.239937
## iter 130 value 8902699.195465
## iter 140 value 8740091.350766
## iter 150 value 8610479.549960
## iter 160 value 8484676.930354
## iter 170 value 8396722.218094
## iter 180 value 8275044.225531
## iter 190 value 8175028.024997
## iter 200 value 8114694.560955
## iter 210 value 8003006.674698
## iter 220 value 7904862.542683
## iter 230 value 7827144.777516
## iter 240 value 7762265.835883
## iter 250 value 7691956.707542
## iter 260 value 7627458.830688
## iter 270 value 7564236.767834
## iter 280 value 7548312.857019
## iter 290 value 7537651.168086
## iter 300 value 7522748.368193
## iter 310 value 7509174.845369
## iter 320 value 7498216.730831
## iter 330 value 7486523.181751
## iter 340 value 7477661.432836
## iter 350 value 7457257.605282
## iter 360 value 7438923.777386
## iter 370 value 7419541.469348
## iter 380 value 7394444.293609
## iter 390 value 7368713.569773
## iter 400 value 7321853.895909
## iter 410 value 7244135.077540
## iter 420 value 7140991.581827
## iter 430 value 7036358.203874
## iter 440 value 6972381.298790
## iter 450 value 6914321.686193
## iter 460 value 6856272.944912
## iter 470 value 6817522.541173
## iter 480 value 6785423.653557
## iter 490 value 6762828.412867
## iter 500 value 6725935.445146
## iter 510 value 6695590.541968
## iter 520 value 6658369.031018
## iter 530 value 6608469.186169
## iter 540 value 6525582.338790
## iter 550 value 6417161.760841
## iter 560 value 6324413.080046
## iter 570 value 6206249.439500
## iter 580 value 6033320.303890
## iter 590 value 5927730.524158
## iter 600 value 5838106.165731
## iter 610 value 5753023.211335
## iter 620 value 5709127.001583
## iter 630 value 5670781.798124
## iter 640 value 5669488.394265
## iter 650 value 5665231.904919
## iter 660 value 5658400.628990
## iter 670 value 5655462.798749
## iter 680 value 5648925.261073
## iter 690 value 5644460.945519
## iter 700 value 5639574.943211
## iter 710 value 5635765.036584
## iter 720 value 5626310.039263
## iter 730 value 5614392.997858
## iter 740 value 5600642.505578
## iter 750 value 5579107.501616
## iter 760 value 5557947.864030
## iter 770 value 5545873.643261
## iter 780 value 5535727.602083
## iter 790 value 5519894.589608
## iter 800 value 5512209.300632
## iter 810 value 5497582.341541
## iter 820 value 5481749.423925
## iter 830 value 5458404.145511
## iter 840 value 5439809.048920
## iter 850 value 5428243.762038
## iter 860 value 5418655.780232
## iter 870 value 5405152.459876
## iter 880 value 5383756.803408
## iter 890 value 5367986.456694
## iter 900 value 5356639.400049
## iter 910 value 5355278.764323
## iter 920 value 5354075.525070
## iter 930 value 5352585.843010
## iter 940 value 5350979.801444
## iter 950 value 5347499.321808
## iter 960 value 5345138.914400
## iter 970 value 5342260.768347
## iter 980 value 5339830.648057
## iter 990 value 5336918.440743
## iter1000 value 5334148.967456
## iter1010 value 5326408.457340
## iter1020 value 5315954.564087
## iter1030 value 5306384.458862
## iter1040 value 5289363.891101
## iter1050 value 5275356.847291
## iter1060 value 5257787.628485
## iter1070 value 5237864.748624
## iter1080 value 5217583.398910
## iter1090 value 5198233.100384
## iter1100 value 5182145.962776
## iter1110 value 5154952.811982
## iter1120 value 5046426.612458
## iter1130 value 4932740.127228
## iter1140 value 4824955.129080
## iter1150 value 4738758.948503
## iter1160 value 4702028.300774
## iter1170 value 4696180.504711
## iter1180 value 4693908.455622
## iter1190 value 4691043.545441
## iter1200 value 4687841.079521
## iter1210 value 4683378.957810
## iter1220 value 4678813.020215
## iter1230 value 4675174.869844
## iter1240 value 4674060.899821
## iter1250 value 4673165.827489
## iter1260 value 4671326.817299
## iter1270 value 4669375.323671
## iter1280 value 4663171.096827
## iter1290 value 4654551.439058
## iter1300 value 4646095.504591
## iter1310 value 4637712.413318
## iter1320 value 4628697.185983
## iter1330 value 4617862.632218
## iter1340 value 4612940.613412
## iter1350 value 4610200.109483
## iter1360 value 4606639.863775
## iter1370 value 4600348.262175
## iter1380 value 4594638.440412
## iter1390 value 4591150.290138
## iter1400 value 4583073.795355
## iter1410 value 4577131.122402
## iter1420 value 4573495.593067
## iter1430 value 4570381.987990
## iter1440 value 4566368.716089
## iter1450 value 4558438.118651
## iter1460 value 4550286.770402
## iter1470 value 4532598.748749
## iter1480 value 4508059.556597
## iter1490 value 4486434.875776
## iter1500 value 4472675.878678
## iter1510 value 4456478.563183
## iter1520 value 4439355.428547
## iter1530 value 4428077.213840
## iter1540 value 4421709.801099
## iter1550 value 4416947.066238
## iter1560 value 4405416.497475
## iter1570 value 4392062.605773
## iter1580 value 4388203.374474
## iter1590 value 4380868.238976
## iter1600 value 4366799.163906
## iter1610 value 4346919.131242
## iter1620 value 4297195.325302
## iter1630 value 4271397.846995
## iter1640 value 4253210.526550
## iter1650 value 4239065.324885
## iter1660 value 4217305.314791
## iter1670 value 4192273.745466
## iter1680 value 4151889.563624
## iter1690 value 4123157.498042
## iter1700 value 4095028.148080
## iter1710 value 4035238.021968
## iter1720 value 3992999.191109
## iter1730 value 3970510.993959
## iter1740 value 3932532.807414
## iter1750 value 3897845.395836
## iter1760 value 3861586.731904
## iter1770 value 3843931.420332
## iter1780 value 3830398.356272
## iter1790 value 3806038.422886
## iter1800 value 3785337.747438
## iter1810 value 3760265.768530
## iter1820 value 3717314.728178
## iter1830 value 3672922.201778
## iter1840 value 3625946.893495
## iter1850 value 3602968.907786
## iter1860 value 3589536.083877
## iter1870 value 3582753.202218
## iter1880 value 3578915.257406
## iter1890 value 3575515.365280
## iter1900 value 3571912.962148
## iter1910 value 3568007.022570
## iter1920 value 3567364.045516
## iter1930 value 3567161.605271
## iter1940 value 3566934.511826
## iter1950 value 3566738.883362
## iter1960 value 3566252.015464
## iter1970 value 3566017.462739
## iter1980 value 3565852.184021
## iter1990 value 3565374.082521
## iter2000 value 3564388.395040
## iter2010 value 3563653.305807
## iter2020 value 3562883.827107
## iter2030 value 3562043.488463
## iter2040 value 3561185.717619
## iter2050 value 3559970.334918
## iter2060 value 3556859.054456
## iter2070 value 3555588.064839
## iter2080 value 3554372.427717
## iter2090 value 3553306.409237
## iter2100 value 3551837.316540
## iter2110 value 3549907.652729
## iter2120 value 3548371.512788
## iter2130 value 3547743.949143
## iter2140 value 3547722.889984
## iter2150 value 3547669.683719
## iter2160 value 3547587.739835
## iter2170 value 3547131.516797
## iter2180 value 3546796.286713
## iter2190 value 3546552.656148
## iter2200 value 3546430.121555
## iter2210 value 3546260.465512
## iter2220 value 3546133.806593
## iter2230 value 3546020.245206
## iter2240 value 3545909.981841
## iter2250 value 3545634.930367
## iter2260 value 3545124.426822
## iter2270 value 3544594.718777
## iter2280 value 3544113.534928
## iter2290 value 3543673.997748
## iter2300 value 3543453.766373
## iter2310 value 3543034.459118
## iter2320 value 3542848.820398
## iter2330 value 3542591.284904
## iter2340 value 3542285.605819
## iter2350 value 3541812.875221
## iter2360 value 3540755.926437
## iter2370 value 3540718.661016
## iter2380 value 3540701.363798
## iter2390 value 3540669.926611
## iter2400 value 3540640.164547
## iter2410 value 3540582.251667
## iter2420 value 3540489.122246
## iter2430 value 3540413.392597
## iter2440 value 3540347.432918
## iter2450 value 3540231.031308
## iter2460 value 3540131.003184
## iter2470 value 3540029.111724
## iter2480 value 3539616.736825
## iter2490 value 3539354.765518
## iter2500 value 3538313.566925
## iter2510 value 3536749.758398
## iter2520 value 3533999.161912
## iter2530 value 3530331.015172
## iter2540 value 3524939.230450
## iter2550 value 3517953.502637
## iter2560 value 3512357.195638
## iter2570 value 3507447.419731
## iter2580 value 3503399.636003
## iter2590 value 3499402.823315
## iter2600 value 3493905.373491
## iter2610 value 3485257.536225
## iter2620 value 3381573.082673
## iter2630 value 3338729.402271
## iter2640 value 3319278.378477
## iter2650 value 3312783.660533
## iter2660 value 3308449.528535
## iter2670 value 3305974.192476
## iter2680 value 3304211.855306
## iter2690 value 3302443.503378
## iter2700 value 3302222.051852
## iter2710 value 3302196.472379
## iter2720 value 3302174.422090
## iter2730 value 3302151.639144
## iter2740 value 3302126.430492
## iter2750 value 3302090.369510
## iter2760 value 3302038.471514
## iter2770 value 3301663.035924
## iter2780 value 3301558.993849
## iter2790 value 3301294.115936
## iter2800 value 3300997.944026
## iter2810 value 3300870.844426
## iter2820 value 3300613.753409
## iter2830 value 3300246.587786
## iter2840 value 3299467.844868
## iter2850 value 3297886.437587
## iter2860 value 3293623.791957
## iter2870 value 3286205.065787
## iter2880 value 3265637.528298
## iter2890 value 3243992.810821
## iter2900 value 3227501.168198
## iter2910 value 3218393.830802
## iter2920 value 3210864.220402
## iter2930 value 3202569.686503
## iter2940 value 3202420.326311
## iter2950 value 3202302.524755
## iter2960 value 3202115.267855
## iter2970 value 3201843.076007
## iter2980 value 3201169.375103
## iter2990 value 3197228.319865
## iter3000 value 3194999.428792
## iter3010 value 3188204.333313
## iter3020 value 3181011.432762
## iter3030 value 3179025.636007
## iter3040 value 3178019.670779
## iter3050 value 3177583.769319
## iter3060 value 3176822.964896
## iter3070 value 3175558.555499
## iter3080 value 3173631.111907
## iter3090 value 3172059.226297
## iter3100 value 3170421.397750
## iter3110 value 3169240.054163
## iter3120 value 3167059.362467
## iter3130 value 3161008.498174
## iter3140 value 3149841.981309
## iter3150 value 3141301.991144
## iter3160 value 3134245.039375
## iter3170 value 3128256.032757
## iter3180 value 3123445.809331
## iter3190 value 3121048.679062
## iter3200 value 3119278.300741
## iter3210 value 3116220.723241
## iter3220 value 3114293.384877
## iter3230 value 3111355.900700
## iter3240 value 3108690.595627
## iter3250 value 3095166.395606
## iter3260 value 3084393.555359
## iter3270 value 3078146.158986
## iter3280 value 3074410.156545
## iter3290 value 3072852.548926
## iter3300 value 3071884.910004
## iter3310 value 3071055.342018
## iter3320 value 3070749.495930
## iter3330 value 3070553.423179
## iter3340 value 3070245.699938
## iter3350 value 3069960.212081
## iter3360 value 3069866.049142
## iter3370 value 3069813.339748
## iter3380 value 3069713.364211
## iter3390 value 3069673.844520
## iter3400 value 3069651.867665
## iter3410 value 3069643.365503
## iter3420 value 3069642.374175
## iter3420 value 3069642.348247
## iter3420 value 3069642.348061
## final  value 3069642.348061 
## converged
neural.casual = predict(neural.model.casual, testset, type="raw")
test$neural.casual = neural.casual
neural.casual.MSE = sum((test$neural.casual-test$casual)^2)/nrow(test)
neural.casual.MSE
## [1] 426.777
neural.casual.result = ggplot(test,aes(casual,neural.casual))+geom_point()
neural.casual.result

test$neural.casual[test$neural.casual < 0] = 0
neural.casual.MSE = sum((test$neural.casual-test$casual)^2)/nrow(test)
neural.casual.MSE
## [1] 412.3482
neural.casual.result = ggplot(test,aes(casual,neural.casual))+geom_point()
neural.casual.result

Step3: Now combine the predicted registered users and casual users

test$neural.combined = test$neural.casual + test$neural.registered
neural.combined.MSE = sum((test$cnt-test$neural.combined)^2)/nrow(test)
neural.combined.MSE
## [1] 4225.822

Factorize data for the rest of models

# Factorization of training data
train$season <- factor(train$season)
train$yr <- factor(train$yr)
train$mnth <- factor(train$mnth)
train$hr <- factor(train$hr)
train$holiday <- factor(train$holiday)
train$weekday<- factor(train$weekday)
train$workingday <- factor(train$workingday)
train$weathersit <- factor(train$weathersit)

# Factorization of test data
test$season <- factor(test$season)
test$yr <- factor(test$yr)
test$mnth <- factor(test$mnth)
test$hr <- factor(test$hr)
test$holiday <- factor(test$holiday)
test$weekday<- factor(test$weekday)
test$workingday <- factor(test$workingday)
test$weathersit <- factor(test$weathersit)

(2) Linear Regression

Step1: Orignal model

# Orignal Model
lm.formula = cnt ~ season + yr + mnth + hr + holiday + weekday + 
    workingday + weathersit + temp + atemp + hum + windspeed
fit.lm = lm(lm.formula,data=train)
summary(fit.lm)
## 
## Call:
## lm(formula = lm.formula, data = train)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -357.98  -60.35   -7.81   50.80  436.42 
## 
## Coefficients: (1 not defined because of singularities)
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -87.367      7.970 -10.962  < 2e-16 ***
## season2       30.622     15.167   2.019 0.043509 *  
## season3        3.532     21.332   0.166 0.868512    
## season4       37.104     14.999   2.474 0.013383 *  
## yr1           87.418      1.859  47.022  < 2e-16 ***
## mnth2         11.189      4.568   2.449 0.014336 *  
## mnth3         29.450      4.939   5.963 2.55e-09 ***
## mnth4         24.124     15.883   1.519 0.128824    
## mnth5         47.844     15.993   2.991 0.002782 ** 
## mnth6         35.138     16.281   2.158 0.030926 *  
## mnth7         35.805     22.074   1.622 0.104820    
## mnth8         49.474     22.012   2.248 0.024623 *  
## mnth9         77.941     21.826   3.571 0.000357 ***
## mnth10        60.024     16.067   3.736 0.000188 ***
## mnth11        42.706     15.780   2.706 0.006812 ** 
## mnth12        38.384     15.071   2.547 0.010883 *  
## hr1          -17.267      6.337  -2.725 0.006441 ** 
## hr2          -27.562      6.357  -4.336 1.47e-05 ***
## hr3          -38.581      6.416  -6.013 1.88e-09 ***
## hr4          -39.780      6.396  -6.219 5.16e-10 ***
## hr5          -24.003      6.364  -3.772 0.000163 ***
## hr6           35.766      6.353   5.630 1.84e-08 ***
## hr7          170.904      6.344  26.938  < 2e-16 ***
## hr8          314.405      6.335  49.630  < 2e-16 ***
## hr9          166.689      6.341  26.287  < 2e-16 ***
## hr10         110.435      6.366  17.348  < 2e-16 ***
## hr11         137.322      6.412  21.417  < 2e-16 ***
## hr12         177.658      6.463  27.490  < 2e-16 ***
## hr13         173.189      6.516  26.580  < 2e-16 ***
## hr14         156.575      6.558  23.876  < 2e-16 ***
## hr15         167.026      6.569  25.426  < 2e-16 ***
## hr16         230.234      6.554  35.127  < 2e-16 ***
## hr17         387.221      6.517  59.413  < 2e-16 ***
## hr18         352.712      6.472  54.495  < 2e-16 ***
## hr19         240.526      6.408  37.538  < 2e-16 ***
## hr20         158.888      6.373  24.931  < 2e-16 ***
## hr21         108.764      6.346  17.138  < 2e-16 ***
## hr22          72.833      6.335  11.497  < 2e-16 ***
## hr23          33.563      6.330   5.302 1.17e-07 ***
## holiday1      -5.153      5.805  -0.888 0.374687    
## weekday1       4.933      3.568   1.382 0.166908    
## weekday2      10.852      3.447   3.149 0.001644 ** 
## weekday3      10.816      3.448   3.137 0.001711 ** 
## weekday4      11.964      3.441   3.477 0.000510 ***
## weekday5      18.493      3.448   5.364 8.30e-08 ***
## weekday6      20.338      3.420   5.946 2.82e-09 ***
## workingday1       NA         NA      NA       NA    
## weathersit2  -11.980      2.258  -5.306 1.14e-07 ***
## weathersit3  -65.555      3.806 -17.223  < 2e-16 ***
## weathersit4  -51.508     71.197  -0.723 0.469414    
## temp         108.484     33.430   3.245 0.001177 ** 
## atemp        110.132     34.130   3.227 0.001255 ** 
## hum          -80.389      6.680 -12.035  < 2e-16 ***
## windspeed    -33.401      8.380  -3.986 6.77e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 100.4 on 11981 degrees of freedom
## Multiple R-squared:  0.6942, Adjusted R-squared:  0.6929 
## F-statistic: 523.1 on 52 and 11981 DF,  p-value: < 2.2e-16
lm.cnt=predict(fit.lm, test)
test$lm.cnt = lm.cnt
lm.MSE = sum((test$cnt - test$lm.cnt)^2)/nrow(test)
lm.MSE
## [1] 11297.42
lm.result = ggplot(test,aes(cnt,lm.cnt))+geom_point()
lm.result

test$lm.cnt[test$lm.cnt < 0] = 0
lm.MSE = sum((test$cnt - test$lm.cnt)^2)/nrow(test)
lm.MSE
## [1] 10738.16
lm.result = ggplot(test,aes(cnt,lm.cnt))+geom_point()
lm.result

Step2: Sepeate models for registered and casual

# Model for Registered
# Windspeed and atemp are taken out
lm.registered.formula = registered ~ season + yr + mnth + hr + holiday  + 
    workingday + weathersit + temp + atemp + hum
lm.model.registered = lm(lm.registered.formula,data=train)
lm.registered = predict(lm.model.registered,test)
test$lm.registered = lm.registered
lm.registered.MSE = sum((test$lm.registered-test$registered)^2)/nrow(test)
lm.registered.MSE
## [1] 8157.388
lm.registered.result = ggplot(test,aes(registered,lm.registered))+geom_point()
lm.registered.result

test$lm.registered[test$lm.registered < 0] = 0
lm.registered.MSE = sum((test$lm.registered-test$registered)^2)/nrow(test)
lm.registered.MSE
## [1] 7763.378
lm.registered.result = ggplot(test,aes(registered,lm.registered))+geom_point()
lm.registered.result

# Model for Casual
# Windspeed is taken out
lm.casual.formula = casual ~ season + yr + mnth + hr + holiday + weekday + workingday+ weathersit + temp + atemp + hum 
lm.model.casual = lm(lm.casual.formula, data = train)
lm.casual = predict(lm.model.casual,test)
test$lm.casual = lm.casual
lm.casual.MSE = sum((test$lm.casual-test$casual)^2)/nrow(test)
lm.casual.MSE
## [1] 998.3204
lm.casual.result = ggplot(test,aes(casual,lm.casual))+geom_point()
lm.casual.result

test$lm.casual[test$lm.casual < 0] = 0
lm.casual.MSE = sum((test$lm.casual-test$casual)^2)/nrow(test)
lm.casual.MSE
## [1] 913.7805
lm.casual.result = ggplot(test,aes(casual,lm.casual))+geom_point()
lm.casual.result

Step3: Now combine the predicted registered users and casual users

test$lm.combined = test$lm.casual + test$lm.registered
lm.combined.MSE = sum((test$cnt-test$lm.combined)^2)/nrow(test)
lm.combined.MSE
## [1] 10684.49

(3) SVM

Step1: Orignal model

# Orignal Model 
svm.formula = cnt ~ season + yr + mnth + hr + holiday + weekday + workingday+ weathersit + temp + atemp + hum + windspeed
svm.model = svm(svm.formula, data = train)
svm.cnt=predict(svm.model, test)
test$svm.cnt = svm.cnt
svm.MSE = sum((test$cnt - test$svm.cnt)^2)/nrow(test)
svm.MSE
## [1] 7546.31
svm.result = ggplot(test,aes(cnt,svm.cnt))+geom_point()
svm.result

test$svm.cnt[test$svm.cnt < 0] = 0
svm.MSE = sum((test$cnt - test$svm.cnt)^2)/nrow(test)
svm.MSE
## [1] 7511.831
svm.result = ggplot(test,aes(cnt,svm.cnt))+geom_point()
svm.result

Step2: Sepeate models for registered and casual

# Model for Registered
svm.registered.formula = registered ~ season + yr + mnth + hr + weekday + workingday + temp + atemp + hum 
svm.model.registered = svm(svm.registered.formula, data = train)
svm.registered = predict(svm.model.registered,test)
test$svm.registered = svm.registered
svm.registered.MSE = sum((test$svm.registered-test$registered)^2)/nrow(test)
svm.registered.MSE
## [1] 5870.63
svm.registered.result = ggplot(test,aes(registered,svm.registered))+geom_point()
svm.registered.result

test$svm.registered[test$svm.registered < 0] = 0
svm.registered.MSE = sum((test$svm.registered-test$registered)^2)/nrow(test)
svm.registered.MSE
## [1] 5854.001
svm.registered.result = ggplot(test,aes(registered,svm.registered))+geom_point()
svm.registered.result

# Model for Casual
svm.casual.formula = casual ~ season + yr + mnth + hr + holiday + weekday + workingday+ weathersit + temp + atemp + hum + windspeed
svm.model.casual = svm(svm.casual.formula, data = train)
svm.casual = predict(svm.model.casual,test)
test$svm.casual = svm.casual
svm.casual.MSE = sum((test$svm.casual-test$casual)^2)/nrow(test)
svm.casual.MSE
## [1] 650.8312
svm.casual.result = ggplot(test,aes(casual,svm.casual))+geom_point()
svm.casual.result

test$svm.casual[test$svm.casual < 0] = 0
svm.casual.MSE = sum((test$svm.casual-test$casual)^2)/nrow(test)
svm.casual.MSE
## [1] 647.9083
svm.casual.result = ggplot(test,aes(casual,svm.casual))+geom_point()
svm.casual.result

Step3: Now combine the predicted registered users and casual users

test$svm.combined = test$svm.casual + test$svm.registered
svm.combined.MSE = sum((test$cnt-test$svm.combined)^2)/nrow(test)
svm.combined.MSE
## [1] 7655.972

(4) Random Forest

Step1: Orignal model

# Orignal Model 
forest.formula = cnt ~ season + yr + mnth + hr + holiday + weekday + workingday+ weathersit + temp + atemp + hum + windspeed
forest.model = randomForest(forest.formula, data = train, importance = TRUE, ntree = 200)
forest.model$importance
##                %IncMSE IncNodePurity
## season      1871.34242     8635705.9
## yr          6088.63993    29990562.8
## mnth        3312.97212    19240883.7
## hr         37931.20654   207515805.3
## holiday       84.49893      821640.8
## weekday     3881.31674    17465827.4
## workingday  4331.12865    12914924.0
## weathersit   901.14050     6197386.3
## temp        4655.11282    25487693.4
## atemp       4356.66207    29818990.7
## hum         3598.02802    24836117.6
## windspeed    354.96422     6505362.1
forest.cnt = predict(forest.model,test)
test$forest.cnt = forest.cnt
forest.MSE = sum((test$cnt - test$forest.cnt)^2)/nrow(test)
forest.MSE
## [1] 3880.79
forest.result = ggplot(test,aes(cnt,forest.cnt))+geom_point()
forest.result

test$forest.cnt[test$forest.cnt < 0] = 0
forest.MSE = sum((test$cnt - test$forest.cnt)^2)/nrow(test)
forest.MSE
## [1] 3880.79
forest.result = ggplot(test,aes(cnt,forest.cnt))+geom_point()
forest.result

Step2: Sepeate models for registered and casual

# Model for Registered
forest.registered.formula = registered ~ season + yr + mnth + hr + holiday + weekday + workingday+ weathersit + temp + atemp + hum + windspeed
forest.model.registered = randomForest(forest.registered.formula, data = train, importance = TRUE, ntree = 200)
forest.model.registered$importance
##                %IncMSE IncNodePurity
## season      1478.23494     6764863.7
## yr          4685.71861    22117971.9
## mnth        2069.56038    11215853.4
## hr         29707.85995   155209963.3
## holiday       74.91226      616967.2
## weekday     3304.74832    14854136.8
## workingday  3934.06460    12417113.8
## weathersit   623.19101     4380534.2
## temp        2196.23971    11315541.4
## atemp       2270.60299    13667475.3
## hum         1837.34149    12978243.7
## windspeed    226.53556     4097693.1
forest.registered = predict(forest.model.registered,test)
test$forest.registered = forest.registered
forest.registered.MSE = sum((test$forest.registered-test$registered)^2)/nrow(test)
forest.registered.MSE
## [1] 2621.8
forest.registered.result = ggplot(test,aes(registered,forest.registered))+geom_point()
forest.registered.result

test$forest.registered[test$forest.registered < 0] = 0
forest.registered.MSE = sum((test$forest.registered-test$registered)^2)/nrow(test)
forest.registered.MSE
## [1] 2621.8
forest.registered.result = ggplot(test,aes(registered,forest.registered))+geom_point()
forest.registered.result

# Model for Casual
forest.casual.formula = casual ~ season + yr + mnth + hr + holiday + weekday + workingday+ weathersit + temp + atemp + hum + windspeed
forest.model.casual = randomForest(forest.casual.formula, data = train, importance = TRUE, ntree = 160)
forest.model.casual$importance
##               %IncMSE IncNodePurity
## season      182.16943      623588.3
## yr          236.50493      986376.2
## mnth        320.39413     1617992.7
## hr         1877.97551     9801297.7
## holiday      25.03605      170788.4
## weekday     768.45470     3078717.6
## workingday  813.36772     3066046.2
## weathersit   49.35926      322439.7
## temp        637.40638     3412538.3
## atemp       612.23397     3852357.0
## hum         293.67728     2120674.9
## windspeed    33.82769      514062.8
forest.casual = predict(forest.model.casual,test)
test$forest.casual = forest.casual
forest.casual.MSE = sum((test$forest.casual-test$casual)^2)/nrow(test)
forest.casual.MSE
## [1] 384.6417
forest.casual.result = ggplot(test,aes(casual,forest.casual))+geom_point()
forest.casual.result

test$forest.casual[test$forest.casual < 0] = 0
forest.casual.MSE = sum((test$forest.casual-test$casual)^2)/nrow(test)
forest.casual.MSE
## [1] 384.6417
forest.casual.result = ggplot(test,aes(casual,forest.casual))+geom_point()
forest.casual.result

Step3: Now combine the predicted registered users and casual users

test$forest.combined = test$forest.casual + test$forest.registered
forest.combined.MSE = sum((test$cnt-test$forest.combined)^2)/nrow(test)
forest.combined.MSE
## [1] 3525.916

(5) GBM

Step1: Orignal model

# Orignal Model 
gbm.formula = cnt ~ season + yr + mnth + hr + holiday + weekday + workingday+ weathersit + temp + atemp + hum + windspeed
gbm.model = gbm(gbm.formula, data=train, n.trees=1000, distribution="gaussian", interaction.depth=5, bag.fraction=0.5, train.fraction=1.0, shrinkage=0.1, keep.data=TRUE)
summary(gbm.model)

##                   var    rel.inf
## hr                 hr 60.2734626
## yr                 yr  9.2049903
## workingday workingday  7.2880830
## temp             temp  6.2409444
## mnth             mnth  4.9286799
## weekday       weekday  3.6216000
## atemp           atemp  2.6415323
## season         season  1.9275302
## weathersit weathersit  1.8871193
## hum               hum  1.5486118
## windspeed   windspeed  0.3005041
## holiday       holiday  0.1369422
pef.trees = gbm.perf(gbm.model)
## Using OOB method...

gbm.cnt = predict(gbm.model, newdata=test, n.trees=pef.trees)
test$gbm.cnt = gbm.cnt
gbm.MSE = sum((test$cnt - test$gbm.cnt)^2)/nrow(test)
gbm.MSE
## [1] 3518.01
gbm.result = ggplot(test,aes(cnt,gbm.cnt))+geom_point()
gbm.result

test$gbm.cnt[test$gbm.cnt < 0] = 0
gbm.MSE = sum((test$cnt - test$gbm.cnt)^2)/nrow(test)
gbm.MSE
## [1] 3506.468
gbm.result = ggplot(test,aes(cnt,gbm.cnt))+geom_point()
gbm.result

Step2: Sepeate models for registered and casual

# Model for Registered
# Take off holiday, weedspeed and hum give the best result
gbm.registered.formula = registered ~ season + yr + mnth + hr + weekday + workingday+ weathersit + temp
gbm.model.registered = gbm(gbm.registered.formula, data=train, n.trees=1000, distribution="gaussian", interaction.depth=5, bag.fraction=0.5, train.fraction=1.0, shrinkage=0.1, keep.data=TRUE)
summary(gbm.model.registered)

##                   var   rel.inf
## hr                 hr 59.236631
## workingday workingday 11.068316
## yr                 yr 10.799411
## mnth             mnth  6.255357
## weekday       weekday  4.714831
## temp             temp  3.322063
## weathersit weathersit  2.817562
## season         season  1.785829
perf.trees.registered = gbm.perf(gbm.model.registered)
## Using OOB method...

gbm.registered = predict(gbm.model.registered,newdata=test,n.trees = perf.trees.registered)
test$gbm.registered = gbm.registered
gbm.registered.MSE = sum((test$gbm.registered-test$registered)^2)/nrow(test)
gbm.registered.MSE
## [1] 2565.314
gbm.registered.result = ggplot(test,aes(registered,gbm.registered))+geom_point()
gbm.registered.result

test$gbm.registered[test$gbm.registered < 0] = 0
gbm.registered.MSE = sum((test$gbm.registered-test$registered)^2)/nrow(test)
gbm.registered.MSE
## [1] 2561.061
gbm.registered.result = ggplot(test,aes(registered,gbm.registered))+geom_point()
gbm.registered.result

# Model for Casual
gbm.casual.formula = casual ~ season + yr + mnth + hr + holiday + weekday + workingday+ weathersit + temp + atemp + hum + windspeed
gbm.model.casual = gbm(gbm.casual.formula, data=train, n.trees=1000, distribution="gaussian", interaction.depth=5, bag.fraction=0.5, train.fraction=1.0, shrinkage=0.1, keep.data=TRUE)
summary(gbm.model.casual)

##                   var    rel.inf
## hr                 hr 34.8206199
## temp             temp 14.7735903
## workingday workingday 13.2366269
## weekday       weekday 11.1743710
## atemp           atemp  8.5829810
## mnth             mnth  7.2801053
## yr                 yr  4.2087383
## hum               hum  3.8232396
## windspeed   windspeed  0.8347149
## weathersit weathersit  0.7122874
## holiday       holiday  0.3983422
## season         season  0.1543832
perf.trees.casual = gbm.perf(gbm.model.casual)
## Using OOB method...

gbm.casual = predict(gbm.model.casual,newdata=test,n.trees = perf.trees.casual)
test$gbm.casual = gbm.casual
gbm.casual.MSE = sum((test$gbm.casual-test$casual)^2)/nrow(test)
gbm.casual.MSE
## [1] 355.4819
gbm.casual.result = ggplot(test,aes(casual,gbm.casual))+geom_point()
gbm.casual.result

test$gbm.casual[test$gbm.casual < 0] = 0
gbm.casual.MSE = sum((test$gbm.casual-test$casual)^2)/nrow(test)
gbm.casual.MSE
## [1] 354.4988
gbm.casual.result = ggplot(test,aes(casual,gbm.casual))+geom_point()
gbm.casual.result

Step3: Now combine the predicted registered users and casual users

test$gbm.combined = test$gbm.casual + test$gbm.registered
gbm.combined.MSE = sum((test$cnt-test$gbm.combined)^2)/nrow(test)
gbm.combined.MSE
## [1] 3349.674

(6) Regression Tree

Step1: Original model

# Orignal Model 
formula.cnt <- cnt ~  season + yr + mnth + hr + holiday + weekday + workingday+ weathersit + temp + atemp + hum + windspeed

fit.rpart.cnt <- rpart(formula.cnt, method="anova", data= train)
print(summary(fit.rpart.cnt))
## Call:
## rpart(formula = formula.cnt, data = train, method = "anova")
##   n= 12034 
## 
##            CP nsplit rel error    xerror        xstd
## 1  0.36074746      0 1.0000000 1.0000552 0.016575310
## 2  0.10238796      1 0.6392525 0.6393086 0.012201926
## 3  0.05755831      2 0.5368646 0.5403609 0.009621364
## 4  0.03971267      3 0.4793063 0.4725795 0.008129473
## 5  0.03050477      4 0.4395936 0.4311524 0.007441543
## 6  0.02638756      5 0.4090888 0.4042151 0.006714492
## 7  0.02181266      6 0.3827013 0.3780289 0.006383340
## 8  0.02006172      7 0.3608886 0.3598845 0.006252274
## 9  0.01945226      8 0.3408269 0.3349020 0.005815467
## 10 0.01611249     10 0.3019223 0.3109133 0.005343486
## 11 0.01412510     11 0.2858099 0.2863092 0.005100399
## 12 0.01332590     12 0.2716848 0.2691068 0.005087445
## 13 0.01123726     13 0.2583589 0.2600968 0.004951488
## 14 0.01000000     14 0.2471216 0.2517729 0.004816167
## 
## Variable importance
##         hr         yr       mnth       temp      atemp        hum 
##         51          9          7          7          7          5 
##     season workingday    weekday  windspeed 
##          5          4          4          1 
## 
## Node number 1: 12034 observations,    complexity param=0.3607475
##   mean=191.5313, MSE=32814.64 
##   left son=2 (4479 obs) right son=3 (7555 obs)
##   Primary splits:
##       hr    splits as  LLLLLLLRRRRRRRRRRRRRRRLL, improve=0.36074750, (0 missing)
##       atemp < 0.5985  to the left,  improve=0.12935650, (0 missing)
##       temp  < 0.55    to the left,  improve=0.10419890, (0 missing)
##       hum   < 0.665   to the right, improve=0.07209467, (0 missing)
##       yr    splits as  LR, improve=0.06667235, (0 missing)
##   Surrogate splits:
##       hum       < 0.765   to the right, agree=0.663, adj=0.094, (0 split)
##       windspeed < 0.09705 to the left,  agree=0.632, adj=0.013, (0 split)
##       temp      < 0.15    to the left,  agree=0.630, adj=0.007, (0 split)
##       atemp     < 0.08335 to the left,  agree=0.629, adj=0.003, (0 split)
## 
## Node number 2: 4479 observations,    complexity param=0.0141251
##   mean=50.22483, MSE=3239.629 
##   left son=4 (2968 obs) right son=5 (1511 obs)
##   Primary splits:
##       hr     splits as  LLLLLLR---------------RR, improve=0.38440840, (0 missing)
##       temp   < 0.47    to the left,  improve=0.06934414, (0 missing)
##       atemp  < 0.4621  to the left,  improve=0.06834863, (0 missing)
##       mnth   splits as  LLLLRRRRRRLL, improve=0.05825211, (0 missing)
##       season splits as  LRRR, improve=0.05093041, (0 missing)
##   Surrogate splits:
##       temp  < 0.77    to the left,  agree=0.666, adj=0.009, (0 split)
##       atemp < 0.73485 to the left,  agree=0.665, adj=0.008, (0 split)
##       hum   < 0.11    to the right, agree=0.663, adj=0.001, (0 split)
## 
## Node number 3: 7555 observations,    complexity param=0.102388
##   mean=275.3052, MSE=31492.39 
##   left son=6 (5036 obs) right son=7 (2519 obs)
##   Primary splits:
##       hr     splits as  -------LRLLLLLLLRRRRLL--, improve=0.1699364, (0 missing)
##       yr     splits as  LR, improve=0.1505464, (0 missing)
##       temp   < 0.49    to the left,  improve=0.1391370, (0 missing)
##       atemp  < 0.47725 to the left,  improve=0.1364355, (0 missing)
##       season splits as  LRRR, improve=0.1233174, (0 missing)
##   Surrogate splits:
##       windspeed  < 0.7985  to the left,  agree=0.667, adj=0.001, (0 split)
##       weathersit splits as  LLLR,        agree=0.667, adj=0.000, (0 split)
##       atemp      < 0.9015  to the left,  agree=0.667, adj=0.000, (0 split)
## 
## Node number 4: 2968 observations
##   mean=25.04549, MSE=933.8123 
## 
## Node number 5: 1511 observations
##   mean=99.68365, MSE=4077.341 
## 
## Node number 6: 5036 observations,    complexity param=0.03971267
##   mean=223.5663, MSE=17665.7 
##   left son=12 (2516 obs) right son=13 (2520 obs)
##   Primary splits:
##       yr     splits as  LR,           improve=0.1762747, (0 missing)
##       temp   < 0.45    to the left,   improve=0.1504354, (0 missing)
##       atemp  < 0.44695 to the left,   improve=0.1476129, (0 missing)
##       season splits as  LRRR,         improve=0.1474771, (0 missing)
##       mnth   splits as  LLLRRRRRRRRR, improve=0.1461857, (0 missing)
##   Surrogate splits:
##       hum        < 0.505   to the right, agree=0.535, adj=0.070, (0 split)
##       temp       < 0.55    to the left,  agree=0.529, adj=0.057, (0 split)
##       atemp      < 0.52275 to the left,  agree=0.528, adj=0.056, (0 split)
##       windspeed  < 0.26865 to the right, agree=0.527, adj=0.053, (0 split)
##       weathersit splits as  RLL-,        agree=0.512, adj=0.024, (0 split)
## 
## Node number 7: 2519 observations,    complexity param=0.05755831
##   mean=378.742, MSE=43083.91 
##   left son=14 (1259 obs) right son=15 (1260 obs)
##   Primary splits:
##       yr     splits as  LR,           improve=0.2094317, (0 missing)
##       temp   < 0.49    to the left,   improve=0.1993776, (0 missing)
##       atemp  < 0.47725 to the left,   improve=0.1949973, (0 missing)
##       season splits as  LRRR,         improve=0.1673958, (0 missing)
##       mnth   splits as  LLLRRRRRRRRR, improve=0.1626678, (0 missing)
##   Surrogate splits:
##       hum        < 0.445   to the right, agree=0.554, adj=0.107, (0 split)
##       atemp      < 0.58335 to the left,  agree=0.533, adj=0.066, (0 split)
##       temp       < 0.59    to the left,  agree=0.532, adj=0.064, (0 split)
##       windspeed  < 0.31345 to the right, agree=0.515, adj=0.030, (0 split)
##       weathersit splits as  RRLR,        agree=0.509, adj=0.018, (0 split)
## 
## Node number 12: 2516 observations,    complexity param=0.01611249
##   mean=167.7186, MSE=9589.757 
##   left son=24 (836 obs) right son=25 (1680 obs)
##   Primary splits:
##       mnth       splits as  LLLLRRRRRRRR, improve=0.2637072, (0 missing)
##       season     splits as  LRRR,         improve=0.2394119, (0 missing)
##       atemp      < 0.4621  to the left,   improve=0.2192312, (0 missing)
##       temp       < 0.47    to the left,   improve=0.2192312, (0 missing)
##       workingday splits as  RL,           improve=0.0625818, (0 missing)
##   Surrogate splits:
##       season    splits as  LRRR,        agree=0.909, adj=0.725, (0 split)
##       atemp     < 0.35605 to the left,  agree=0.793, adj=0.378, (0 split)
##       temp      < 0.39    to the left,  agree=0.791, adj=0.372, (0 split)
##       hum       < 0.405   to the left,  agree=0.688, adj=0.060, (0 split)
##       windspeed < 0.37315 to the right, agree=0.687, adj=0.057, (0 split)
## 
## Node number 13: 2520 observations,    complexity param=0.02006172
##   mean=279.3254, MSE=19505.74 
##   left son=26 (721 obs) right son=27 (1799 obs)
##   Primary splits:
##       temp    < 0.39    to the left,   improve=0.1611695, (0 missing)
##       atemp   < 0.4015  to the left,   improve=0.1571376, (0 missing)
##       mnth    splits as  LLRRRRRRRRRR, improve=0.1559454, (0 missing)
##       season  splits as  LRRR,         improve=0.1512278, (0 missing)
##       weekday splits as  RLLLLLR,      improve=0.0617536, (0 missing)
##   Surrogate splits:
##       atemp  < 0.4015  to the left,   agree=0.996, adj=0.986, (0 split)
##       mnth   splits as  LLRRRRRRRRLL, agree=0.873, adj=0.558, (0 split)
##       season splits as  LRRR,         agree=0.803, adj=0.311, (0 split)
##       hum    < 0.91    to the right,  agree=0.715, adj=0.004, (0 split)
## 
## Node number 14: 1259 observations,    complexity param=0.02181266
##   mean=283.7141, MSE=22177.25 
##   left son=28 (524 obs) right son=29 (735 obs)
##   Primary splits:
##       mnth       splits as  LLLLRRRRRRRL, improve=0.3084984, (0 missing)
##       season     splits as  LRRR,         improve=0.3003280, (0 missing)
##       temp       < 0.47    to the left,   improve=0.2964665, (0 missing)
##       atemp      < 0.4621  to the left,   improve=0.2964665, (0 missing)
##       workingday splits as  LR,           improve=0.1184052, (0 missing)
##   Surrogate splits:
##       temp      < 0.47    to the left,  agree=0.841, adj=0.618, (0 split)
##       atemp     < 0.4621  to the left,  agree=0.841, adj=0.618, (0 split)
##       season    splits as  LRRR,        agree=0.833, adj=0.599, (0 split)
##       hum       < 0.385   to the left,  agree=0.626, adj=0.101, (0 split)
##       windspeed < 0.31345 to the right, agree=0.622, adj=0.092, (0 split)
## 
## Node number 15: 1260 observations,    complexity param=0.03050477
##   mean=473.6944, MSE=45934.88 
##   left son=30 (400 obs) right son=31 (860 obs)
##   Primary splits:
##       workingday splits as  LR,           improve=0.2081288, (0 missing)
##       temp       < 0.43    to the left,   improve=0.1949022, (0 missing)
##       atemp      < 0.4318  to the left,   improve=0.1887979, (0 missing)
##       mnth       splits as  LLRRRRRRRRRR, improve=0.1842011, (0 missing)
##       weekday    splits as  LRRRRRL,      improve=0.1820905, (0 missing)
##   Surrogate splits:
##       weekday splits as  LRRRRRL,     agree=0.968, adj=0.900, (0 split)
##       holiday splits as  RL,          agree=0.714, adj=0.100, (0 split)
##       atemp   < 0.1894  to the left,  agree=0.689, adj=0.020, (0 split)
##       hum     < 0.195   to the left,  agree=0.686, adj=0.010, (0 split)
##       temp    < 0.93    to the right, agree=0.685, adj=0.007, (0 split)
## 
## Node number 24: 836 observations
##   mean=96.43062, MSE=3492.994 
## 
## Node number 25: 1680 observations
##   mean=203.1929, MSE=8836.312 
## 
## Node number 26: 721 observations
##   mean=190.7587, MSE=9708.069 
## 
## Node number 27: 1799 observations,    complexity param=0.01945226
##   mean=314.821, MSE=19028.77 
##   left son=54 (1240 obs) right son=55 (559 obs)
##   Primary splits:
##       workingday splits as  RL, improve=0.13644570, (0 missing)
##       weekday    splits as  RLLLLLR, improve=0.13176200, (0 missing)
##       hr         splits as  -------R-RLRRRRR----RL--, improve=0.05825584, (0 missing)
##       weathersit splits as  RRL-, improve=0.05455847, (0 missing)
##       atemp      < 0.61365 to the left,  improve=0.05099518, (0 missing)
##   Surrogate splits:
##       weekday splits as  RLLLLLR,     agree=0.974, adj=0.916, (0 split)
##       holiday splits as  LR,          agree=0.715, adj=0.084, (0 split)
##       atemp   < 0.85605 to the left,  agree=0.694, adj=0.016, (0 split)
##       temp    < 0.95    to the left,  agree=0.692, adj=0.009, (0 split)
##       hum     < 0.205   to the right, agree=0.690, adj=0.004, (0 split)
## 
## Node number 28: 524 observations
##   mean=185.7519, MSE=11624.71 
## 
## Node number 29: 735 observations
##   mean=353.5537, MSE=17981.19 
## 
## Node number 30: 400 observations,    complexity param=0.01123726
##   mean=330.325, MSE=32555.74 
##   left son=60 (164 obs) right son=61 (236 obs)
##   Primary splits:
##       temp   < 0.49    to the left,  improve=0.3407615, (0 missing)
##       atemp  < 0.47725 to the left,  improve=0.3407615, (0 missing)
##       hr     splits as  --------L-------RRRL----, improve=0.2724685, (0 missing)
##       mnth   splits as  LLRRRRRRRRRL, improve=0.2545665, (0 missing)
##       season splits as  LRRR, improve=0.1791564, (0 missing)
##   Surrogate splits:
##       atemp      < 0.47725 to the left,   agree=1.000, adj=1.000, (0 split)
##       season     splits as  LRRL,         agree=0.878, adj=0.701, (0 split)
##       mnth       splits as  LLRRRRRRRLLL, agree=0.878, adj=0.701, (0 split)
##       hum        < 0.745   to the right,  agree=0.638, adj=0.116, (0 split)
##       weathersit splits as  RRL-,         agree=0.612, adj=0.055, (0 split)
## 
## Node number 31: 860 observations,    complexity param=0.02638756
##   mean=540.3779, MSE=38150.68 
##   left son=62 (344 obs) right son=63 (516 obs)
##   Primary splits:
##       hr     splits as  --------R-------LRRL----, improve=0.3175968, (0 missing)
##       mnth   splits as  LLLRRRRRRRLL, improve=0.2233810, (0 missing)
##       season splits as  LRRR, improve=0.2183222, (0 missing)
##       temp   < 0.49    to the left,  improve=0.1973805, (0 missing)
##       atemp  < 0.47725 to the left,  improve=0.1920992, (0 missing)
##   Surrogate splits:
##       temp      < 0.87    to the right, agree=0.602, adj=0.006, (0 split)
##       atemp     < 0.75    to the right, agree=0.602, adj=0.006, (0 split)
##       hum       < 0.215   to the left,  agree=0.601, adj=0.003, (0 split)
##       windspeed < 0.56715 to the right, agree=0.601, adj=0.003, (0 split)
## 
## Node number 54: 1240 observations
##   mean=280.6089, MSE=8820.401 
## 
## Node number 55: 559 observations,    complexity param=0.01945226
##   mean=390.712, MSE=33317.61 
##   left son=110 (204 obs) right son=111 (355 obs)
##   Primary splits:
##       hr        splits as  -------L-LRRRRRR----LL--, improve=0.57408930, (0 missing)
##       hum       < 0.575   to the right, improve=0.24107720, (0 missing)
##       atemp     < 0.61365 to the left,  improve=0.10204650, (0 missing)
##       temp      < 0.55    to the left,  improve=0.06584068, (0 missing)
##       windspeed < 0.31345 to the left,  improve=0.04022250, (0 missing)
##   Surrogate splits:
##       hum < 0.715   to the right, agree=0.696, adj=0.167, (0 split)
## 
## Node number 60: 164 observations
##   mean=203.9756, MSE=14611.74 
## 
## Node number 61: 236 observations
##   mean=418.1271, MSE=26222.35 
## 
## Node number 62: 344 observations
##   mean=405.564, MSE=17641.27 
## 
## Node number 63: 516 observations,    complexity param=0.0133259
##   mean=630.2539, MSE=31629.39 
##   left son=126 (213 obs) right son=127 (303 obs)
##   Primary splits:
##       mnth       splits as  LLLRRRRRRRLL, improve=0.3224286, (0 missing)
##       temp       < 0.49    to the left,   improve=0.3217399, (0 missing)
##       season     splits as  LRRR,         improve=0.3196353, (0 missing)
##       atemp      < 0.47725 to the left,   improve=0.3146083, (0 missing)
##       weathersit splits as  RRLL,         improve=0.1475443, (0 missing)
##   Surrogate splits:
##       season    splits as  LRRL,        agree=0.913, adj=0.789, (0 split)
##       temp      < 0.49    to the left,  agree=0.888, adj=0.728, (0 split)
##       atemp     < 0.47725 to the left,  agree=0.882, adj=0.714, (0 split)
##       hum       < 0.92    to the right, agree=0.597, adj=0.023, (0 split)
##       windspeed < 0.43285 to the right, agree=0.597, adj=0.023, (0 split)
## 
## Node number 110: 204 observations
##   mean=208.2696, MSE=11069.36 
## 
## Node number 111: 355 observations
##   mean=495.5521, MSE=15983.78 
## 
## Node number 126: 213 observations
##   mean=509.8075, MSE=21483.6 
## 
## Node number 127: 303 observations
##   mean=714.9241, MSE=21394.31 
## 
## n= 12034 
## 
## node), split, n, deviance, yval
##       * denotes terminal node
## 
##   1) root 12034 394891300 191.53130  
##     2) hr=0,1,2,3,4,5,6,22,23 4479  14510300  50.22483  
##       4) hr=0,1,2,3,4,5 2968   2771555  25.04549 *
##       5) hr=6,22,23 1511   6160863  99.68365 *
##     3) hr=7,8,9,10,11,12,13,14,15,16,17,18,19,20,21 7555 237925000 275.30520  
##       6) hr=7,9,10,11,12,13,14,15,20,21 5036  88964490 223.56630  
##        12) yr=0 2516  24127830 167.71860  
##          24) mnth=1,2,3,4 836   2920143  96.43062 *
##          25) mnth=5,6,7,8,9,10,11,12 1680  14845000 203.19290 *
##        13) yr=1 2520  49154470 279.32540  
##          26) temp< 0.39 721   6999518 190.75870 *
##          27) temp>=0.39 1799  34232750 314.82100  
##            54) workingday=1 1240  10937300 280.60890 *
##            55) workingday=0 559  18624540 390.71200  
##             110) hr=7,9,20,21 204   2258150 208.26960 *
##             111) hr=10,11,12,13,14,15 355   5674242 495.55210 *
##       7) hr=8,16,17,18,19 2519 108528400 378.74200  
##        14) yr=0 1259  27921150 283.71410  
##          28) mnth=1,2,3,4,12 524   6091350 185.75190 *
##          29) mnth=5,6,7,8,9,10,11 735  13216170 353.55370 *
##        15) yr=1 1260  57877950 473.69440  
##          30) workingday=0 400  13022300 330.32500  
##            60) temp< 0.49 164   2396326 203.97560 *
##            61) temp>=0.49 236   6188474 418.12710 *
##          31) workingday=1 860  32809580 540.37790  
##            62) hr=16,19 344   6068599 405.56400 *
##            63) hr=8,17,18 516  16320770 630.25390  
##             126) mnth=1,2,3,11,12 213   4576007 509.80750 *
##             127) mnth=4,5,6,7,8,9,10 303   6482477 714.92410 *
# Access significant  vairables
fit.rpart.cnt$variable.importance
##         hr         yr       mnth       temp      atemp        hum 
##  209578416   38411469   27767349   27500369   27021271   20736191 
##     season workingday    weekday  windspeed    holiday weathersit 
##   19497245   16716978   15119647    4642626    1597331    1048780
# Validate the fit.rpart model using testing data 
test.cnt <- test[, "cnt"]
test.x <- test[, 3:14]
rpart.cnt <- predict(fit.rpart.cnt, test.x)
test$rpart.cnt <- rpart.cnt
rpart.MSE = mean((rpart.cnt - test.cnt)^2)
rpart.MSE
## [1] 9523.749
rpart.result = ggplot(test,aes(cnt,rpart.cnt))+geom_point()
rpart.result

Step2: Sepeate models for registered and casual

# Model for Registered
formula.registered <- registered ~  season + yr + mnth + hr + holiday + weekday + workingday + weathersit + temp + atemp + hum + windspeed
fit.rpart.registered <- rpart(formula.registered, method="anova", data= train)

test.registered <- test[, "registered"]
rpart.registered <- predict(fit.rpart.registered, test.x)
test$rpart.registered <- rpart.registered
rpart.registered.MSE = mean((rpart.registered - test.registered)^2)
rpart.registered.MSE
## [1] 7096.629
rpart.registered.result = ggplot(test,aes(registered,rpart.registered))+geom_point()
rpart.registered.result

# Model for Casual
formula.casual <- casual ~ season + yr + mnth + hr + holiday + weekday + workingday + weathersit + temp + atemp + hum + windspeed

fit.rpart.casual <-  rpart(formula.casual, method="anova", data= train)

test.casual <- test[, "casual"]
rpart.casual <- predict(fit.rpart.casual, test.x)
test$rpart.casual <- rpart.casual 
rpart.casual.MSE = mean((rpart.casual - test.casual)^2)
rpart.casual.MSE
## [1] 661.3148
rpart.casual.result = ggplot(test,aes(casual,rpart.casual))+geom_point()
rpart.casual.result

Step3: Now combine the predicted registered users and casual users

rpart.combined = rpart.casual + rpart.registered
test$rpart.combined <- rpart.combined
rpart.combined.MSE = sum((test$cnt-test$rpart.combined)^2)/nrow(test)
rpart.combined.MSE
## [1] 8568.5

Now let’s compare the results of all the models, the comparism table.

rpart.MSEs <- c(rpart.MSE, rpart.combined.MSE, rpart.registered.MSE, rpart.casual.MSE )
rpart.MSEs <- matrix(rpart.MSEs, nrow= 1, ncol=4) 
colnames(rpart.MSEs) <-c("cnt.MSE", "combined.MSE", "registered.MSE", "casual.MSE" )
rownames(rpart.MSEs) <- "rpart.MSEs"
lm.MSEs = c(lm.MSE, lm.combined.MSE, lm.registered.MSE, lm.casual.MSE)
forest.MSEs =  c(forest.MSE, forest.combined.MSE, forest.registered.MSE, forest.casual.MSE )
svm.MSEs = c(svm.MSE, svm.combined.MSE, svm.registered.MSE, svm.casual.MSE)
neural.MSEs = c(neural.MSE, neural.combined.MSE, neural.registered.MSE, neural.casual.MSE)
gbm.MSEs = c(gbm.MSE, gbm.combined.MSE, gbm.registered.MSE, gbm.casual.MSE)
Summary.MSE=rbind(rpart.MSEs, forest.MSEs, svm.MSEs, neural.MSEs, gbm.MSEs, lm.MSEs)
Summary.MSE
##               cnt.MSE combined.MSE registered.MSE casual.MSE
## rpart.MSEs   9523.749     8568.500       7096.629   661.3148
## forest.MSEs  3880.790     3525.916       2621.800   384.6417
## svm.MSEs     7511.831     7655.972       5854.001   647.9083
## neural.MSEs 33104.730     4225.822       3407.284   412.3482
## gbm.MSEs     3506.468     3349.674       2561.061   354.4988
## lm.MSEs     10738.163    10684.486       7763.378   913.7805

As shown above, as of registered users, the random forest provides the best prediction, i.e. lowest MSE, and of casual users, GBM yields the best result. Notice predicted results of random forest and GBM contain unwanted negative values, so we have to manually convert them to 0.

This transformation is crucial because the final output we like to generate is the total number of rentals, which are derived from the predicted registered users (by RandomForest) and predicted casual users (by GBM). Since each output of a model includes negative values, we did transformation before adding those predicted values. Finally, with this additional step, we got the least MSE, which essential would be our ideal model.


Explorations


After knowing the relative influence (from gbm) of “cnt”, “registered”, “casual”, we attemped to visualize the story behind the scene.

Important Variables (Relative Influence) cnt: hr, yr, workingday, temp, mnth, weekday….. registered: hr, workingday, yr, mnth, weekday….. casual: hr, weekday, temp, workingday…..

Because of the similarities among these factors, we subjectively grouped them into 4 different factor groups and visualized the plottings against “cnt” (Total users), “registered” (Registered users), and “casual” (Casual users).

Factor groups:
(1) hr
(2) yr, mnth
(3) working, weekday, holiday
(4) temp, atemp, humidity

In terms of the number of attributes, we couldn’t do all the plottings between each two. Therefore, to be logical, we showed the most significant relationships between x’s(e.g. hr, yr ..) and y’s (e.g. cnt, registered, casual).

# Ratio of 2 types of users 
r.registered <- sum(hour.data$registered) / sum(hour.data$cnt)
r.casual <- (1- r.registered)
print(c("Registered %",r.registered ))
## [1] "Registered %"      "0.811698316173547"
print(c("Causal %", r.casual))
## [1] "Causal %"          "0.188301683826453"

Registerd users majorly account for the rental usage. Because “hr” is the most significant factor, let’s see the initial plot between “hr” and “cnt”.

plot(x= hour.data$hr, y= hour.data$cnt)

From this plot, we assume there exists an interesting pattern - peak period. As we can see that the cnt stick out during 7-9am and 5-7pm, which coincides with the peak periods when people go to work in the morning and when people get off from work in the afternoon. Let’s see the finer grained plots.


Plottings


Since hour is the most significant factor for both registered and casual users, let’s see how the cnt, registered and casual fluctuate with the hour.

As the matter of fact that such rush-hours patterns exist, we divided the hour factor into 5 segments to better visualize and understand how rental number changes for both registered users and casual users with the time of a day.

# Create daypart column, default to "Night"
hour.data$daypart <- "Night"
# 0am -7am: "Early morning"
hour.data$daypart[(hour.data$hr >=0 ) & (hour.data$hr <7 )] <- "Early Morning" 
# 7am- 9am : "Peak Morning"
hour.data$daypart[(hour.data$hr >=7 ) & (hour.data$hr <9 )] <- "Peak Morning"
# 9am- 5pm : "Day"
hour.data$daypart[(hour.data$hr >=9 ) & (hour.data$hr <17 )] <- "Day"
# 5pm- 7pm : "Peak Evenning"
hour.data$daypart[(hour.data$hr >=17 ) & (hour.data$hr <20 )] <- "Peak Evening"

# Factorization
hour.data$daypart <- factor(hour.data$daypart)

(1)Hour

# Count by hour
g.cnt.hr <- ggplot(hour.data, aes(x = hr, y = cnt))
g.cnt.hr + geom_point(aes(color = daypart)) + ggtitle("Total Rental by Hour")

It is shown two peaks during the morning and evening peaking hours from 7am to 9am and from 5pm to 8pm. Let’s break it down into registered and casual users.

# Registered by hour
g.registered.hr <- ggplot(hour.data, aes(x = hr, y = registered))
g.registered.hr + geom_point(aes(color = daypart)) + ggtitle("Registered Rental by Hour")

The peaking hours are even more obvious for registered users. Apprently, many registered users commute to work by rental bikes.

# Casual by hour 
g.casual.hr <- ggplot(hour.data, aes(x = hr, y = casual))
g.casual.hr + geom_point(aes(color = daypart)) + ggtitle("Casual Rental by Hour")

There was little impact of the peaking hour on casual users. It implies that people who commute by rental bikes mostly are the registered users. Also, lots of casual users tend to use the service in the afternoon, which may correlate with temperature or other weather factors (because starting from 11am, it gets hotter). We would examine our hypothesis later, the relationship between temperature and the causal users.

(2)Year & Month

# Monthly total rental fluctuation in two years 
year <- function(x) {
  y = 
    if (x == 0) 2011
    else 2012
  return (y)
} 
hour.data$year <- factor(sapply(hour.data$yr, year))
g.cnt.mnth <- ggplot(hour.data, aes(as.numeric(mnth), as.numeric(cnt), colour = as.factor(year)))

g.cnt.mnth + geom_smooth(se = FALSE, method = "auto") + ggtitle("Monthly Total Rental Over Two Years")

The ridership increased significantly in 2012. Furthermore, since 81% of the users are registered, we assumed that the ridership of registered users went up drastically. Let’s evaluate our assumption as followed.

# Monthly registered rental fluctuation in two years 
g.registered.mnth <- ggplot(hour.data, aes(as.numeric(mnth), as.numeric(registered), colour = as.factor(year)))

g.registered.mnth + geom_smooth(se = FALSE, method = "auto") + ggtitle("Monthly Registered Rental Over Two Years")

Not only did the ridership of registered users increase significantly, there is also an interesting pattern. While the ridership of registered users of the first 7 months increased steadily, it appeared to be a jump from August in 2011. This might result from new policies or other environmental factors.

Notice, usage is generally lower in Winter, which may be related to lower temperature.

# Monthly casual rental fluctuation in two years
g.casual.mnth <- ggplot(hour.data, aes(as.numeric(mnth), as.numeric(casual), colour = as.factor(year)))

g.casual.mnth + geom_smooth(se = FALSE, method = "auto") + ggtitle("Monthly Casual Rental Over Two Years")

There are a lot more casual users in Summer and Fall than Spring and Winter. This fact also explains why casual users are more affected by the environmental settings than registered users.

Recall the previous graph, as compared to casual users, registered users’ usage curve are flatter than causal users’, because registered users who use bikes to commute are using them regularly relatively insensitive to the month.

(3) Working Day, Weekday, Holiday

# Count by hour on workingday 
g.cnt.workday <- ggplot(hour.data, aes(x = hr, y = cnt, fill = as.factor(workingday)))
g.cnt.workday + geom_bar(stat = "identity", position="dodge") + ggtitle("Total Rental by Workingday")

1 denotes working day, while 0 denotes non-working day. On working days, the peaking hours are very obvious, while on non-working days, many people use the rental bikes in the afternoon. One way to look at this is that maybe on a non-working day, people like to use the service for fun.

# Registered on workingday
g.registered.workday <- ggplot(hour.data, aes(x = hr, y = registered, fill = as.factor(workingday)))
g.registered.workday + geom_bar(stat = "identity", position="dodge") + ggtitle("Registered Rental by Workingday")

On working days, most registered users use rental bikes during peak periods, meaning that, again, most registered users are bike commuters. On non-working days, their usage is less fluctuated. Let’s evaluate:

library(data.table)
## 
## Attaching package: 'data.table'
## The following object is masked _by_ '.GlobalEnv':
## 
##     year
# Subsetting 
sub.hour.data <- hour.data[,c("daypart", "cnt", "registered", "casual")]

# Create data table
dt <-as.data.table(sub.hour.data)

# Extract data where daypart is "Peak Morning" and "Peak Evening " 
dt.peak <- dt[daypart %in% c("Peak Morning", "Peak Evening")]

# 42.3% of registered users use bike in peak hours
dt.peak[, sum(registered)] / dt[, sum(registered)]
## [1] 0.4230142

Notice that 81% of the users are registered and 42.3% of whom used bikes in peak hours, indicating the management needs to pay special attention to peak hours bike arrangement.

# Casual on workingday
g.casual.workday <- ggplot(hour.data, aes(x = hr, y = casual, fill = as.factor(workingday)))
g.casual.workday + geom_bar(stat = "identity", position="dodge") + ggtitle("Casual Rental by Workingday")

Given that most of the users are registered users (81%), of all 19% users are casual users. Unlike registered users, casual users’ pattern on working days is similar to non-working days. Notably, casual users tend to use the services on non-working days, especially in the day time, when human activities are vivid. Or maybe they couldn’t get enough bikes on working days due to their lower priority.

# Count on weekday
g.cnt.hr.byweekday <- ggplot(hour.data, aes(as.numeric(hr), as.numeric(cnt), colour = as.factor(weekday)))
g.cnt.hr.byweekday + geom_smooth(se = FALSE, method = "auto") + ggtitle("Total Rentals vs Hour (from Sunday to Monday)")

Rentals from Monday to Friday falls into one pattern, while rentals from Saturday and Sunday falls into the other. This pattern matches the result of the previous graph, reflecting that the management has to treat the arrangement of bikes on weekdays and weekends very differently.

Let’s evaluate our assumption:

# Check the propotion of registered & casual users on a working or non-working day
sub.hour.data <- hour.data[,c("workingday", "cnt", "registered", "casual")]
dt <-as.data.table(sub.hour.data)
dt.wd1 <- dt[workingday == 1]
dt.wd0 <- dt[workingday == 0]

# On a working day, 87% of users are registered 
dt.wd1[, sum(registered)] /dt.wd1[, sum(cnt)]
## [1] 0.8677004
# While on a non-working day, casual users accounts 32% of total ridership
dt.wd0[, sum(casual)] /dt.wd0[, sum(cnt)]
## [1] 0.3166468

The business insight here is that, when it is a working day, the management should stress on providing the best services for registered users.

Conversely, when it is not, although registered users are still more than casual users, the demand for casual users becomes important, especially from 12- 17pm. On a non-working day, the proportion of causal users increases from 13% on a working day to 32%.

(4) Temp, ATemp, Humidity

Just a picture of how temperature changes in a one-year time frame.

# mnth vs temp
g.temp.mnth <- ggplot(hour.data, aes(as.numeric(mnth), temp))
g.temp.mnth + geom_smooth(se = FALSE, method = "auto") + ggtitle("Temperature fluctuation in an Year")

As disscused above, we assumed that registered users less affected by environmental settings. Let’s see:

# Registered on temp 
g.registered.temp <- ggplot(hour.data, aes(x = temp, y = registered))
g.registered.temp + geom_point()

Without surprise, the number of rental bikes for registered users does not change much according to temperature, which proves our assumption.

# Casual on temp
g.casual.temp <- ggplot(hour.data, aes(x = temp, y = casual))
g.casual.temp + geom_point()

Casual users are more sensitive to temperature than registered users. The usage is much higher between 20 to 30 Celsius degree. Interestingly, casual users find it unbearable when the temperature exceeds 30 degrees, hence, the ridership of casual users dropped greatly.

# Registered/casual on feeled temp
hour.data$raw.atemp <- hour.data$atemp*50
g.temp2 <- ggplot(hour.data, aes(x = raw.atemp, y = registered))
g.temp2 + geom_point()

g.temp2 <- ggplot(hour.data, aes(x = raw.atemp, y = casual))
g.temp2 + geom_point()

The atemp plot is similar to the temp plot.

# Registered/casual on humidity
g.registered.hum <- ggplot(hour.data, aes(x = hum, y = registered))
g.registered.hum + geom_point()

g.casual.hum <- ggplot(hour.data, aes(x = hum, y = casual))
g.casual.hum + geom_point()

Casual users are more sensitive to humidity than registered users, but the casual usage is also kind of smooth except extreme humidity (e.g. heavy rain). It indicates that biking activity is not relatively sensitive to humidity.


Reflection


The plotting results have confirmed our assumption that hr, mnth, workingday, temp, hum have the major correlation with cnt/registered/casual. We also have the following interesting findings: 1) Most registered users commute to work by rental bike, while casual users do not. 2) 2012 showed an increase in users from 2011, contributed majorly by registered users. 3) On working days and non-working days, the usage pattern by hour differs a lot. 4) Casual users are more sensitive to weather condition than registered users.

The biking sharing system should allocate bikes considering these facts.


License, Acknowledgement, and References


[1] Fanaee-T, Hadi, and Gama, Joao, “Event labeling combining ensemble detectors and background knowledge”, Progress in Artificial Intelligence (2013): pp. 1-15, Springer Berlin Heidelberg, doi:10.1007/s13748-013-0040-3.

@article{ year={2013}, issn={2192-6352}, journal={Progress in Artificial Intelligence}, doi={10.1007/s13748-013-0040-3}, title={Event labeling combining ensemble detectors and background knowledge}, url={http://dx.doi.org/10.1007/s13748-013-0040-3}, publisher={Springer Berlin Heidelberg}, keywords={Event labeling; Event detection; Ensemble learning; Background knowledge}, author={Fanaee-T, Hadi and Gama, Joao}, }

[2] https://rpubs.com/saitej09/bikesharing

[3] https://rpubs.com/yroy/bike

[4] http://brandonharris.io/kaggle-bike-sharing/